Trends in Data Science & Business Analytics
  • Home
  • Data Cleaning & Exploration
    • Data Cleaning
    • Exploratory Data Analysis
    • Skill Gap Analysis
  • Machine Learning Methods
    • Supervised Machine Learning
    • Unsupervised Machine Learning

Random Forest Classification for ML/Data Science Requirement

import pandas as pd

df = pd.read_parquet("data/eda.parquet",  engine='pyarrow')
df.columns
Index(['COMPANY', 'LOCATION', 'POSTED', 'MIN_EDULEVELS_NAME',
       'MAX_EDULEVELS_NAME', 'MIN_YEARS_EXPERIENCE', 'MAX_YEARS_EXPERIENCE',
       'TITLE', 'SKILLS', 'SPECIALIZED_SKILLS', 'CERTIFICATIONS',
       'COMMON_SKILLS', 'SOFTWARE_SKILLS', 'SOC_2021_4_NAME', 'NAICS_2022_6',
       'NAICS2_NAME', 'REMOTE_TYPE_NAME', 'SALARY', 'TITLE_NAME',
       'SKILLS_NAME', 'SPECIALIZED_SKILLS_NAME', 'BODY'],
      dtype='object')
ml_keywords = ["machine learning", "data science", "ai", "artificial intelligence", "deep learning", "data scientist"]

def requires_ml(skills):
    if pd.isnull(skills):
        return 0
    skills = skills.lower()
    return int(any(kw in skills for kw in ml_keywords))

df["REQUIRES_ML"] = df["SKILLS_NAME"].apply(requires_ml)
features = ["TITLE", "SOC_2021_4_NAME", "NAICS2_NAME", "MIN_EDULEVELS_NAME", "MIN_YEARS_EXPERIENCE"]
target = "REQUIRES_ML"

df = df[features + [target, 'BODY']].dropna()
from sklearn.preprocessing import LabelEncoder

df_encoded = df.copy()
label_encoders = {}

for col in features:
    if df_encoded[col].dtype == "object":
        le = LabelEncoder()
        df_encoded[col] = le.fit_transform(df_encoded[col])
        label_encoders[col] = le
from sklearn.model_selection import train_test_split

X = df_encoded[features]
y = df_encoded[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.76      0.77      0.77      7991
           1       0.70      0.69      0.70      6278

    accuracy                           0.74     14269
   macro avg       0.73      0.73      0.73     14269
weighted avg       0.74      0.74      0.74     14269
import plotly.express as px
fig = px.bar(
    x=rf.feature_importances_,
    y=features,
    orientation='h',
    labels={'x': 'Importance', 'y': 'Feature'},
    title='Feature Importance – ML Role Classification'
)

fig.update_layout(
    yaxis=dict(categoryorder='total ascending'),
    margin=dict(l=100, r=20, t=50, b=20),
    height=500,
    template='plotly_white'
)

fig.write_html(
    'figures/rm_model_plot1.html',
    include_plotlyjs='cdn',
    full_html=False
)

This bar chart displays the feature importance scores from a random forest model predicting whether a job role involves ML/Data Science. The most influential feature by far in the model is the job title (TITLE), which has a significantly higher importance than all other variables. Secondary contributors include industry classification (NAICS2_NAME) and minimum years of experience, where education level and SOC code had relatively low influence on the model’s prediction. This is suggesting that the job title alone carries strong predictive power for identifying ML-related roles.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Cleaned job descriptions
df['BODY_clean'] = df['BODY'].fillna("").str.lower()

# Target
y = df['REQUIRES_ML']  # this should be a binary 1/0 column

# TF-IDF vectorization
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf.fit_transform(df['BODY_clean'])
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.82      0.90      0.86     10000
           1       0.86      0.75      0.80      7836

    accuracy                           0.84     17836
   macro avg       0.84      0.83      0.83     17836
weighted avg       0.84      0.84      0.83     17836
import numpy as np
importances = model.feature_importances_
top_idx = np.argsort(importances)[-20:]
top_words = tfidf.get_feature_names_out()[top_idx]
top_importances = importances[top_idx]

fig = px.bar(
    x=top_importances,
    y=top_words,
    orientation='h',
    labels={'x': 'Importance', 'y': 'Word'},
    title='Top 20 TF-IDF Words for ML Role Classification'
)

fig.update_layout(
    yaxis={'categoryorder':'total ascending'},
    margin=dict(l=120, r=20, t=60, b=20),
    width=800, height=600, template='plotly_white'
)

fig.write_html(
    'figures/rm_model_plot2.html',
    include_plotlyjs='cdn',
    full_html=False
)

This bar chart shows the top words contributing to the classification of job roles as Machine Learning (ML)related based on job description data. Surprisingly, the most influential words are “attention,” “chain,” and “supply”, which could be an indication of overlap with supply chain roles or reflect noise in the model. More expected terms like “machine,” “learning,” “python,” “AI,” and “analytics” also appear, reinforcing that relevant technical language still plays a role in identifying ML-related positions. The presence of general words like “strong” or “communication” suggests that not all influential terms are strictly technical.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import numpy as np
import plotly.figure_factory as ff
labels = [str(lbl) for lbl in model.classes_]

cm = confusion_matrix(y_test, y_pred)
labels = [str(c) for c in model.classes_]  

fig = ff.create_annotated_heatmap(
    z=cm,
    x=labels,
    y=labels,
    colorscale='Blues',
    showscale=True,
    annotation_text=cm,
    hoverinfo='z'
)

fig.update_layout(
    title='Confusion Matrix – ML Role Classification',
    xaxis_title='Predicted Label',
    yaxis_title='Actual Label',
    xaxis=dict(tickmode='array', tickvals=list(range(len(labels))), ticktext=labels),
    yaxis=dict(tickmode='array', tickvals=list(range(len(labels))), ticktext=labels),
    width=700,
    height=600,
    template='plotly_white',
    margin=dict(l=80, r=20, t=60, b=80)
)

fig.write_html(
    "figures/rm_model_plot3.html",
    include_plotlyjs='cdn',
    full_html=False
)

We selected a combination of structured and unstructured features to predict whether a job role requires Machine Learning or Data Science. Structured features such as TITLE, SOC_2021_4_NAME, NAICS2_NAME, MIN_EDULEVELS_NAME, and MIN_YEARS_EXPERIENCE were chosen based on domain relevance—these fields reflect the role’s function, industry, required education, and experience level, all of which can signal ML-related requirements. Additionally, we included the job description BODY text, applying TF-IDF vectorization to extract key terms. This allowed the model to learn from nuanced language patterns within postings. Feature importance and performance metrics confirm that both structured metadata and text data contribute meaningfully to classification accuracy.